INTRODUCTION

This code through analyzes flight dataset and predicts the flight delay using regression models.

From this code through it can be observed how to wrangle, clean and transform dataset to prepare it for the models that is used.

This code through also introduces regression models and explains results by confusion matrixes which visualizes and summarizes the performance of a classification algorithm.

Content Overview

  • First, I will introduce the dataset and variables which I collected.

  • Second, in the explanatory data analysis section, it can be observed that total minutes of the delayings by its reasons for each airline and percentage of departure delays by day and on weekends for each airlines.

  • Additionally, three regression models are applied which are Logistic Regression, KNN and Naive Bayes. From the results of regression models prediction of flight delays will be made and performance of those models will be analyzed.


DATA

The dataset is raw data collected from Bureau Transportation Statistics which collects and publishes comprehensive transportation statistics. The dataset contains all the flights from Atlanta airport in the months of January 2022 and April 2022 with selected airlines which are Delta Airlines, Spirit Airlines, and Frontier Airlines.

There are around 70,855 rows in this dataset and 17 variables indicating the features of the flight including information about carrier code, month, day, flight number, origin airport, destination airport, airplane information, departure time, arrival time, and delays (related to departure, weather, security etc.).

Lets check the variables:

For your reference, the airline abbreviations used in the dataset refer as follows:

  • NK: Spirit Airlines
  • DL: Delta Airlines
  • F9: Frontier Airlines


Exploratory Data Analysis

In this section, you will be able to see the trend on which day of the week and weekends contains the most flight delays per airline.

First, lets add Weekend variable into the dataset:

I used grepl() function which searches for matches of certain character pattern and returns TRUE if a string contains the pattern, otherwise FALSE. Since Saturday and Sunday starts with “S”, I adjusted the code according to this condition and added Weekend variable with following code:

dataset$Weekend <- grepl("S.+",weekdays(dataset$`Date (MM/DD/YYYY)`))

dataset %>%
  mutate(Weekend=grepl("S.+",weekdays(dataset$`Date (MM/DD/YYYY)`)))


Second, lets check the sum of all delay reasons per airline.

To find this, I used pivot_longer function which is used to pivot a data frame from a wide format to a long format so that each column that has a “Delay” reason is pivoted into its own rows. After this, I used group by() and summarise() functions on carrier code and delay category to have sum of delays. Lastly, I plotted graph with ggplot package.

dataset %>%
  pivot_longer(cols=contains("Delay"),
                    names_to='Delay Reasons',
                    values_to='Delays') %>%
  group_by(`Carrier Code`,`Delay Reasons`)%>%
  summarise(Delays = sum(Delays))%>%
ggplot(aes(x=Delays,y=`Delay Reasons`, fill=`Carrier Code` ))+
  geom_col()+
  labs(x="Sum of Departure Delays(mins) ",title = "Sum of Delays by Delay Reasons per Airline ")+
  theme_bw()+
  theme(plot.title = element_text(hjust = 0.5))+
  theme(
    axis.title.x = element_text(family="serif", color="black",size = 14, face = "bold"),
    axis.title.y = element_text(family="serif", color="black",size = 14, face = "bold"),
    plot.title = element_text(family="serif",color="black", size=16, face="bold"),
    text=element_text(family="serif",size=14))+
  theme(legend.text = element_text(family="serif",size = 10))+
  theme(legend.title = element_text(family="serif",face = "bold",size=12))


As we can see from the graph, it can be observed that the departure delay is highest number among all delays for each airline and delay security is the lowest one among all delays for each airline. Thus, I decided to focus on departure delay for this analysis


Next, to analyze the flight departure delayings, I created is_delay variable which shows if the departure delay is 15 mins or greater than 15 mins, it will be considered as delay.

I applied ifelse function to create a dummy variable which result 0 and 1. 1 means delays and 0 means non-delays.

dataset$is_delay <- ifelse(dataset$`Departure delay (Minutes)`>= 15, 1,0)

dataset %>%
  mutate(is_delay=ifelse(dataset$`Departure delay (Minutes)`>= 15, 1,0))


Now, lets check the percentage of departure delayings by day of the week for each airline:

I grouped dataset by airline and days and calculated percentage of the delays with count(),group_by() and sum() functions. Then I used ggplot to visualize the percentage of departure delays by days for each airline.

 dataset %>% 
  count(`Carrier Code`, Day, is_delay) %>% 
  group_by(`Carrier Code`, Day) %>%
  mutate(percent_delay = (n/sum(n)*100) %>% round(1)) %>% 
  filter(is_delay==1) %>%
  ggplot(aes(x = factor(Day), y = percent_delay, group=`Carrier Code`, color =`Carrier Code`)) + 
  geom_line(size=1) +
  labs(title = "Percentage of Departure Delays by the Days of the Week for Each Airline ",y="Percentage  of   Delays",x="Day of the Week", caption = "**1:Sun | 2:Mon | 3:Tue | 4:Wed | 5:Fri | 6:Sat | 7:Sun**")+
  theme_bw()+
  theme(plot.title = element_text(hjust = 0.5))+
  theme(
    axis.title.x = element_text(family="serif",color = "black", size = 10, face = "bold"),
    axis.title.y = element_text(family="serif",color = "black", size = 10, face = "bold"),
    plot.title = element_text(family="serif",color="black", size=14, face="bold"),
    text=element_text(family="serif"))+
  theme(legend.text = element_text(family="serif",size = 10))+
  theme(legend.title = element_text(family="serif",face = "bold",size=12))+
  theme(
    legend.position = c(1, .1),
    legend.justification = c("right", "bottom"),
    legend.box.just = "right",
    legend.margin = margin(6, 6, 6, 6),
    legend.background = element_rect(fill = "white", colour = "black"))


The graph shows the highest percentage of departure delay is on Saturday for each airline. Among three airlines, mainly the lower percentage delay is in Delta Airlines.


Lastly lets find out which airline has the highest percentage of departure delay on weekend:

Again, to find which airline has the most percentage of departure delay on weekend, I grouped by darrier code and weekend variables and calculated percentage of delay with sum function. Then I created pie chart with ggplot().

dataset %>% 
  count(`Carrier Code`, Weekend, is_delay) %>% 
  group_by(`Carrier Code`, Weekend) %>%
  mutate(percent_delay = (n/sum(n)*100) %>% round(1)) %>% 
  filter(is_delay==1 & Weekend==1) %>%
ggplot(aes(x = "", y = percent_delay, fill = fct_inorder(`Carrier Code`))) +
  geom_col(width = 1, color = 1) +
  geom_text(aes(label = paste0(percent_delay, "%")),
            position = position_stack(vjust = 0.5),
            size=4,family="serif") +
  coord_polar(theta = "y") +
  labs(title="The percentage of Flight Departure Delays on Weekends per Airline")+
  scale_fill_brewer(palette = "Pastel1")+
  guides(fill = guide_legend(title = "Airlines")) +
  theme_void()+
  theme(axis.line = element_blank(),
        axis.text = element_blank(),
        axis.ticks = element_blank(),
        plot.title = element_text(hjust = 0.5,vjust=-1),
        text=element_text(family="serif"))+
  theme(legend.text = element_text(family="serif",size = 10))+
  theme(legend.title = element_text(family="serif",face = "bold",size=12))

According to the pie chart, the highest percentage of departure delay on weekends is 28.4% at Frontier Airlines while the lowest percentage of departure delay on weekends is 19.7% at Delta Airlines.


METHODOLOGY

To predict the flight delays, three methods is used: Logistic regression model, K- Nearest Neighbors Model (KNN), and Naïve Bayes.

  • Logistic regression estimates the probability of an event occurring, based on a given dataset of independent variables.
  • KNN uses proximity to make classifications or predictions about the grouping of an individual data point.
  • Naive Bayes probabilistic machine learning algorithm based on the Bayes Theorem, and used in a wide variety of classification tasks.


Data Preparation & Analysis

Before starting regression models, to fit the models I applied lm() function. Then I applied summary() function to interpret the most important statistical values for the analysis.

correlation <-
lm(formula = `Departure delay (Minutes)` ~ Month + Day + Weekend + `Flight Number`, data=dataset)
summary(correlation)
## 
## Call:
## lm(formula = `Departure delay (Minutes)` ~ Month + Day + Weekend + 
##     `Flight Number`, data = dataset)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
##  -43.40  -13.35   -9.79   -2.04 1046.10 
## 
## Coefficients:
##                   Estimate Std. Error t value             Pr(>|t|)    
## (Intercept)      3.7055366  0.5394328   6.869   0.0000000000065039 ***
## Month            2.1479429  0.1260504  17.040 < 0.0000000000000002 ***
## Day              0.6157975  0.0713288   8.633 < 0.0000000000000002 ***
## WeekendTRUE      0.6683261  0.3174769   2.105               0.0353 *  
## `Flight Number` -0.0012528  0.0001672  -7.492   0.0000000000000688 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 37.71 on 70850 degrees of freedom
## Multiple R-squared:  0.00624,    Adjusted R-squared:  0.006184 
## F-statistic: 111.2 on 4 and 70850 DF,  p-value: < 0.00000000000000022

It can be observed from the outcome above, the variables that I will use are statistically significant.


Logistic Regression

For logistic regression I will use following variables so that I selected those variables:

log_reg <- dataset %>%
select(`Carrier Code`, `Date (MM/DD/YYYY)`, Month, Day, Weekend, `Flight Number`, is_delay)

head(log_reg)
log_reg$is_delay <- as.factor(log_reg$is_delay)

The target variable which is is_delay converted into factor variable for the analysis.


First, the dataset is splited into train and test datasets as 70% of train set and 30% of test set with the following code:

#Splitting

set.seed(1234)
index<-createDataPartition(log_reg$is_delay,p=0.7,list=FALSE)
train<- log_reg[index,]
test<- log_reg[-index,]

table(train$is_delay)
## 
##     0     1 
## 40316  9283

set.seed() generate random numbers createDataPartition() function is used to split data as 0.7 train and 0.3 test dataset, then it is assigned as the name of train and test dataset.

The sample sizes are controlled by table() function; there are 40,316 non-delays and 9283 delays


Next, to increase statistical power, sample sizes between delays and non-delays are balanced with the downSample() function which decreases the size of the majority class to be the same or closer to the minority class size by just taking out a random sample.

#Downsample
set.seed(111)
traindown<-downSample(x=train[,-ncol(train)],
                      y=train$is_delay)
table(traindown$Class)
## 
##    0    1 
## 9283 9283

As we can see there are 9283 delays and 9283 non-delays sample sizes.


Then, logistic regression model is applied with dependent and independent variables and assigned as the name of modeldown with the following code:

# Model
glm(Class ~ `Carrier Code` + Month + Day + Weekend +  `Flight Number`, data=traindown, family=binomial(link = "logit"))
## 
## Call:  glm(formula = Class ~ `Carrier Code` + Month + Day + Weekend + 
##     `Flight Number`, family = binomial(link = "logit"), data = traindown)
## 
## Coefficients:
##      (Intercept)  `Carrier Code`F9  `Carrier Code`NK             Month  
##      -0.55370332        0.64325330        0.55576311        0.19834226  
##              Day       WeekendTRUE   `Flight Number`  
##       0.03119752        0.13448737       -0.00009894  
## 
## Degrees of Freedom: 18565 Total (i.e. Null);  18559 Residual
## Null Deviance:       25740 
## Residual Deviance: 25330     AIC: 25340
modeldown <- glm(Class ~ `Carrier Code` + Month + Day + Weekend + `Flight Number`, data=traindown, family=binomial(link = "logit"))

After the model, predict() function is used to predict the flight departure delays and then threshold selected as 0.5 which means that it makes prediction for each row where that probability is greater than or equal to 0.5.

# Predicting
predicted <- predict(modeldown, test, type="response")

#Threshold
predicted <-ifelse(predicted> 0.5,1,0)

predicted <- as.factor(predicted)

The class of “predicted” values converted to the factors for the analysis.


Lastly, confusion matrix is created with the following code:

confusionMatrix(predicted, test$is_delay)
## Confusion Matrix and Statistics
## 
##           Reference
## Prediction    0    1
##          0 9693 1837
##          1 7585 2141
##                                              
##                Accuracy : 0.5567             
##                  95% CI : (0.55, 0.5634)     
##     No Information Rate : 0.8129             
##     P-Value [Acc > NIR] : 1                  
##                                              
##                   Kappa : 0.0638             
##                                              
##  Mcnemar's Test P-Value : <0.0000000000000002
##                                              
##             Sensitivity : 0.5610             
##             Specificity : 0.5382             
##          Pos Pred Value : 0.8407             
##          Neg Pred Value : 0.2201             
##              Prevalence : 0.8129             
##          Detection Rate : 0.4560             
##    Detection Prevalence : 0.5424             
##       Balanced Accuracy : 0.5496             
##                                              
##        'Positive' Class : 0                  
## 


Confusion Matrix explained in the Findings section.

KNN

For KNN I will use following variables so that I selected those variables.

knn <- dataset %>%
select(`Carrier Code`, Month, Day, Weekend, `Flight Number`, is_delay)

head(knn)
knn$is_delay <- as.factor(knn$is_delay)
knn$Weekend <- as.numeric(knn$Weekend)
knn$`Carrier Code`  = as.numeric(as.factor(knn$`Carrier Code`))

The target variable which is is_delay converted into factor variable for the analysis.


First, the dataset is splited into train and test datasets as 70% of train set and 30% of test set with the following code:

#Splitting

set.seed(1234)
to_take <- floor(0.70* nrow(knn))

set.seed(111)
train_idx <- sample(seq_len(nrow(knn)), size = to_take)
train2 <- knn[train_idx, ]
test2 <- knn[-train_idx, ]

table(train2$is_delay)
## 
##     0     1 
## 40353  9245
train2$`Carrier Code` <- as.numeric(as.factor(train2$`Carrier Code`))
test2$`Carrier Code` <- as.numeric(as.factor(test2$`Carrier Code`))

train_scale <- scale(train2[, 1:5])
test_scale <- scale(test2[, 1:5])

set.seed() generate random numbers With createDataPartition() function is used to split data as 0.7 train and 0.3 test dataset, then it is assigned as the name of train and test dataset.

The sample sizes are controlled by table() function; there are 40,353 non-delays and 9245 delays


Next, to increase statistical power, sample sizes between delays and non-delays are balanced with the downSample() function which decreases the size of the majority class to be the same or closer to the minority class size by just taking out a random sample.

#Downsample
set.seed(111)
traindown2<-downSample(x=train_scale[,-ncol(train_scale)],
                      y=train2$is_delay)
table(traindown2$Class)
## 
##    0    1 
## 9245 9245

As we can see there are 9245 delays and 9245 non-delays sample sizes.


Then, knn model is applied with dependent and independent variables and assigned as the name of classifier_knn with the following code:

# Model

classifier_knn <- knn(train = traindown2,
                      test = test_scale,
                      cl = traindown2$Class,
                      k = 1)

After model, confusion matrix is created with the following code:

#Confusion Matrix
confusionMatrix(classifier_knn, test2$is_delay)
## Confusion Matrix and Statistics
## 
##           Reference
## Prediction     0     1
##          0 10046  2536
##          1  7195  1480
##                                              
##                Accuracy : 0.5422             
##                  95% CI : (0.5355, 0.5489)   
##     No Information Rate : 0.8111             
##     P-Value [Acc > NIR] : 1                  
##                                              
##                   Kappa : -0.0338            
##                                              
##  Mcnemar's Test P-Value : <0.0000000000000002
##                                              
##             Sensitivity : 0.5827             
##             Specificity : 0.3685             
##          Pos Pred Value : 0.7984             
##          Neg Pred Value : 0.1706             
##              Prevalence : 0.8111             
##          Detection Rate : 0.4726             
##    Detection Prevalence : 0.5919             
##       Balanced Accuracy : 0.4756             
##                                              
##        'Positive' Class : 0                  
## 


Confusion Matrix explained in the Findings section.

Naive Bayes

For Naive Bayes I will use following variables so that I selected those variables.

NB <- dataset %>%
select(`Carrier Code`, Month, Day, Weekend, `Flight Number`, is_delay)

NB$is_delay <- as.factor(NB$is_delay)
NB$Weekend <- as.numeric(NB$Weekend)
NB$`Carrier Code`  = as.numeric(as.factor(NB$`Carrier Code`))
   
head(NB)

The target variable which is is_delay converted into factor variable for the analysis.


First, the dataset is splited into train and test datasets as 70% of train set and 30% of test set with the following code:

#Splitting

set.seed(1234)
sample <- sample(c(TRUE, FALSE), nrow(NB), replace=TRUE, prob=c(0.7,0.3))
train3 <- NB[sample, ]
test3 <- NB[!sample, ]

table(train3$is_delay)
## 
##     0     1 
## 40162  9299

set.seed() generate random numbers With createDataPartition() function is used to split data as 0.7 train and 0.3 test dataset, then it is assigned as the name of train and test dataset.

The sample sizes are controlled by table() function; there are 40,162 non-delays and 9299 delays


Next, to increase statistical power, sample sizes between delays and non-delays are balanced with the downSample() function which decreases the size of the majority class to be the same or closer to the minority class size by just taking out a random sample.

#Downsample
set.seed(111)
traindown3<-downSample(x=train3[,-ncol(train3)],
                      y=train3$is_delay)
table(traindown3$Class)
## 
##    0    1 
## 9299 9299

As we can see there are 9299 delays and 9299 non-delays sample sizes.


Then, Naive Bayes model is applied with dependent and independent variables and assigned as the name of classifier_cl with the following code:

# Fitting Naive Bayes Model
# to training dataset

set.seed(120)  # Setting Seed
classifier_cl <- naiveBayes(Class ~ ., data = traindown3)
classifier_cl
## 
## Naive Bayes Classifier for Discrete Predictors
## 
## Call:
## naiveBayes.default(x = X, y = Y, laplace = laplace)
## 
## A-priori probabilities:
## Y
##   0   1 
## 0.5 0.5 
## 
## Conditional probabilities:
##    Carrier Code
## Y       [,1]      [,2]
##   0 1.089795 0.3885904
##   1 1.148188 0.4852965
## 
##    Month
## Y       [,1]     [,2]
##   0 2.493709 1.114077
##   1 2.722766 1.117206
## 
##    Day
## Y       [,1]     [,2]
##   0 3.988493 1.974939
##   1 4.083020 2.065498
## 
##    Weekend
## Y        [,1]      [,2]
##   0 0.2683084 0.4431028
##   1 0.2990644 0.4578727
## 
##    Flight Number
## Y       [,1]     [,2]
##   0 1730.418 857.7741
##   1 1656.907 848.7378

After the model, predict() function applied on test dataset to predict the flight departure delays:

# Predicting on test dataset
y_pred <- predict(classifier_cl, newdata = test3)

Lastly, confusion matrix is created with the following code:

# Confusion Matrix
cm3 <- table(y_pred, test3$is_delay)

# Model Evaluation
cm3 <- confusionMatrix(cm3)
cm3
## Confusion Matrix and Statistics
## 
##       
## y_pred     0     1
##      0 13794  2832
##      1  3638  1130
##                                              
##                Accuracy : 0.6976             
##                  95% CI : (0.6914, 0.7037)   
##     No Information Rate : 0.8148             
##     P-Value [Acc > NIR] : 1                  
##                                              
##                   Kappa : 0.0709             
##                                              
##  Mcnemar's Test P-Value : <0.0000000000000002
##                                              
##             Sensitivity : 0.7913             
##             Specificity : 0.2852             
##          Pos Pred Value : 0.8297             
##          Neg Pred Value : 0.2370             
##              Prevalence : 0.8148             
##          Detection Rate : 0.6448             
##    Detection Prevalence : 0.7771             
##       Balanced Accuracy : 0.5383             
##                                              
##        'Positive' Class : 0                  
## 


Confusion Matrix explained in the Findings section.


FINDINGS

Logistic Regression

#Confusion Matrix

cm <- confusionMatrix(predicted, test$is_delay)

draw_confusion_matrix <- function(cm) {

  total <- sum(cm$table)
  res <- as.numeric(cm$table)

  # Generate color gradients. Palettes come from RColorBrewer.
  greenPalette <- c("#F7FCF5","#E5F5E0","#C7E9C0","#A1D99B","#74C476","#41AB5D","#238B45","#006D2C","#00441B")
  redPalette <- c("#FFF5F0","#FEE0D2","#FCBBA1","#FC9272","#FB6A4A","#EF3B2C","#CB181D","#A50F15","#67000D")
  getColor <- function (greenOrRed = "green", amount = 0) {
    if (amount == 0)
      return("#FFFFFF")
    palette <- greenPalette
    if (greenOrRed == "red")
      palette <- redPalette
    colorRampPalette(palette)(100)[10 + ceiling(90 * amount / total)]
  }

  # set the basic layout
  layout(matrix(c(1,1,2)))
  par(mar=c(2,2,2,2))
  plot(c(100, 345), c(300, 450), type = "n", xlab="", ylab="", xaxt='n', yaxt='n')
  title('CONFUSION MATRIX - Logistic Regression', cex.main=2)

  # create the matrix 
  classes = colnames(cm$table)
  rect(150, 430, 240, 370, col=getColor("green", res[1]))
  text(195, 435, classes[1], cex=1.2)
  rect(250, 430, 340, 370, col=getColor("red", res[3]))
  text(295, 435, classes[2], cex=1.2)
  text(125, 370, 'Predicted', cex=1.3, srt=90, font=2)
  text(245, 450, 'Actual', cex=1.3, font=2)
  rect(150, 305, 240, 365, col=getColor("red", res[2]))
  rect(250, 305, 340, 365, col=getColor("green", res[4]))
  text(140, 400, classes[1], cex=1.2, srt=90)
  text(140, 335, classes[2], cex=1.2, srt=90)

  # add in the cm results
  text(195, 400, res[1], cex=1.6, font=2, col='white')
  text(195, 335, res[2], cex=1.6, font=2, col='white')
  text(295, 400, res[3], cex=1.6, font=2, col='white')
  text(295, 335, res[4], cex=1.6, font=2, col='white')

  # add in the specifics 
  plot(c(100, 0), c(100, 0), type = "n", xlab="", ylab="", main = "DETAILS", xaxt='n', yaxt='n')
  text(10, 85, names(cm$byClass[1]), cex=1.2, font=2)
  text(10, 70, round(as.numeric(cm$byClass[1]), 3), cex=1.2)
  text(30, 85, names(cm$byClass[2]), cex=1.2, font=2)
  text(30, 70, round(as.numeric(cm$byClass[2]), 3), cex=1.2)
  text(50, 85, names(cm$byClass[5]), cex=1.2, font=2)
  text(50, 70, round(as.numeric(cm$byClass[5]), 3), cex=1.2)
  text(70, 85, names(cm$byClass[6]), cex=1.2, font=2)
  text(70, 70, round(as.numeric(cm$byClass[6]), 3), cex=1.2)
  text(90, 85, names(cm$byClass[7]), cex=1.2, font=2)
  text(90, 70, round(as.numeric(cm$byClass[7]), 3), cex=1.2)

  # add in the accuracy information 
  text(30, 35, names(cm$overall[1]), cex=1.5, font=2)
  text(30, 20, round(as.numeric(cm$overall[1]), 3), cex=1.4)
  text(70, 35, names(cm$overall[2]), cex=1.5, font=2)
  text(70, 20, round(as.numeric(cm$overall[2]), 3), cex=1.4)
}

draw_confusion_matrix(cm)

Precision rate shows how many of the correctly predicted delayings actually turned out to be delaying. Sensitivity shows how many of the actual delayings I was able to predict correctly with the model while specificity shows how many of the non-delays were correctly classified by model.

According to the confusion matrix of logistic regression, the model’s accuracy is 55% which is average. 84% of the correctly predicted delayings turned out to be delaying. Whereas 56% of the delayings were successfully predicted by the model. That is average!

KNN

cm2 <- confusionMatrix(classifier_knn, test2$is_delay)

draw_confusion_matrix <- function(cm2) {

  total <- sum(cm2$table)
  res <- as.numeric(cm2$table)

  # Generate color gradients. Palettes come from RColorBrewer.
  greenPalette <- c("#F7FCF5","#E5F5E0","#C7E9C0","#A1D99B","#74C476","#41AB5D","#238B45","#006D2C","#00441B")
  redPalette <- c("#FFF5F0","#FEE0D2","#FCBBA1","#FC9272","#FB6A4A","#EF3B2C","#CB181D","#A50F15","#67000D")
  getColor <- function (greenOrRed = "green", amount = 0) {
    if (amount == 0)
      return("#FFFFFF")
    palette <- greenPalette
    if (greenOrRed == "red")
      palette <- redPalette
    colorRampPalette(palette)(100)[10 + ceiling(90 * amount / total)]
  }

  # set the basic layout
  layout(matrix(c(1,1,2)))
  par(mar=c(2,2,2,2))
  plot(c(100, 345), c(300, 450), type = "n", xlab="", ylab="", xaxt='n', yaxt='n')
  title('CONFUSION MATRIX - KNN', cex.main=2)

  # create the matrix 
  classes = colnames(cm2$table)
  rect(150, 430, 240, 370, col=getColor("green", res[1]))
  text(195, 435, classes[1], cex=1.2)
  rect(250, 430, 340, 370, col=getColor("red", res[3]))
  text(295, 435, classes[2], cex=1.2)
  text(125, 370, 'Predicted', cex=1.3, srt=90, font=2)
  text(245, 450, 'Actual', cex=1.3, font=2)
  rect(150, 305, 240, 365, col=getColor("red", res[2]))
  rect(250, 305, 340, 365, col=getColor("green", res[4]))
  text(140, 400, classes[1], cex=1.2, srt=90)
  text(140, 335, classes[2], cex=1.2, srt=90)

  # add in the cm2 results
  text(195, 400, res[1], cex=1.6, font=2, col='white')
  text(195, 335, res[2], cex=1.6, font=2, col='white')
  text(295, 400, res[3], cex=1.6, font=2, col='white')
  text(295, 335, res[4], cex=1.6, font=2, col='white')

  # add in the specifics 
  plot(c(100, 0), c(100, 0), type = "n", xlab="", ylab="", main = "DETAILS", xaxt='n', yaxt='n')
  text(10, 85, names(cm2$byClass[1]), cex=1.2, font=2)
  text(10, 70, round(as.numeric(cm2$byClass[1]), 3), cex=1.2)
  text(30, 85, names(cm2$byClass[2]), cex=1.2, font=2)
  text(30, 70, round(as.numeric(cm2$byClass[2]), 3), cex=1.2)
  text(50, 85, names(cm2$byClass[5]), cex=1.2, font=2)
  text(50, 70, round(as.numeric(cm2$byClass[5]), 3), cex=1.2)
  text(70, 85, names(cm2$byClass[6]), cex=1.2, font=2)
  text(70, 70, round(as.numeric(cm2$byClass[6]), 3), cex=1.2)
  text(90, 85, names(cm2$byClass[7]), cex=1.2, font=2)
  text(90, 70, round(as.numeric(cm2$byClass[7]), 3), cex=1.2)

  # add in the accuracy information 
  text(30, 35, names(cm2$overall[1]), cex=1.5, font=2)
  text(30, 20, round(as.numeric(cm2$overall[1]), 3), cex=1.4)
  text(70, 35, names(cm2$overall[2]), cex=1.5, font=2)
  text(70, 20, round(as.numeric(cm2$overall[2]), 3), cex=1.4)
}

draw_confusion_matrix(cm2)

According to the confusion matrix of KNN, the model’s accuracy is 54% which is average. 80% of the correctly predicted delayings turned out to be delaying. Whereas 58% of the delayings were successfully predicted by the model. That is also average!

Naive Bayes

draw_confusion_matrix <- function(cm3) {

  total <- sum(cm3$table)
  res <- as.numeric(cm3$table)

  # Generate color gradients. Palettes come from RColorBrewer.
  greenPalette <- c("#F7FCF5","#E5F5E0","#C7E9C0","#A1D99B","#74C476","#41AB5D","#238B45","#006D2C","#00441B")
  redPalette <- c("#FFF5F0","#FEE0D2","#FCBBA1","#FC9272","#FB6A4A","#EF3B2C","#CB181D","#A50F15","#67000D")
  getColor <- function (greenOrRed = "green", amount = 0) {
    if (amount == 0)
      return("#FFFFFF")
    palette <- greenPalette
    if (greenOrRed == "red")
      palette <- redPalette
    colorRampPalette(palette)(100)[10 + ceiling(90 * amount / total)]
  }

  # set the basic layout
  layout(matrix(c(1,1,2)))
  par(mar=c(2,2,2,2))
  plot(c(100, 345), c(300, 450), type = "n", xlab="", ylab="", xaxt='n', yaxt='n')
  title('CONFUSION MATRIX - Naive Bayes', cex.main=2)

  # create the matrix 
  classes = colnames(cm3$table)
  rect(150, 430, 240, 370, col=getColor("green", res[1]))
  text(195, 435, classes[1], cex=1.2)
  rect(250, 430, 340, 370, col=getColor("red", res[3]))
  text(295, 435, classes[2], cex=1.2)
  text(125, 370, 'Predicted', cex=1.3, srt=90, font=2)
  text(245, 450, 'Actual', cex=1.3, font=2)
  rect(150, 305, 240, 365, col=getColor("red", res[2]))
  rect(250, 305, 340, 365, col=getColor("green", res[4]))
  text(140, 400, classes[1], cex=1.2, srt=90)
  text(140, 335, classes[2], cex=1.2, srt=90)

  # add in the cm3 results
  text(195, 400, res[1], cex=1.6, font=2, col='white')
  text(195, 335, res[2], cex=1.6, font=2, col='white')
  text(295, 400, res[3], cex=1.6, font=2, col='white')
  text(295, 335, res[4], cex=1.6, font=2, col='white')

  # add in the specifics 
  plot(c(100, 0), c(100, 0), type = "n", xlab="", ylab="", main = "DETAILS", xaxt='n', yaxt='n')
  text(10, 85, names(cm3$byClass[1]), cex=1.2, font=2)
  text(10, 70, round(as.numeric(cm3$byClass[1]), 3), cex=1.2)
  text(30, 85, names(cm3$byClass[2]), cex=1.2, font=2)
  text(30, 70, round(as.numeric(cm3$byClass[2]), 3), cex=1.2)
  text(50, 85, names(cm3$byClass[5]), cex=1.2, font=2)
  text(50, 70, round(as.numeric(cm3$byClass[5]), 3), cex=1.2)
  text(70, 85, names(cm3$byClass[6]), cex=1.2, font=2)
  text(70, 70, round(as.numeric(cm3$byClass[6]), 3), cex=1.2)
  text(90, 85, names(cm3$byClass[7]), cex=1.2, font=2)
  text(90, 70, round(as.numeric(cm3$byClass[7]), 3), cex=1.2)

  # add in the accuracy information 
  text(30, 35, names(cm3$overall[1]), cex=1.5, font=2)
  text(30, 20, round(as.numeric(cm3$overall[1]), 3), cex=1.4)
  text(70, 35, names(cm3$overall[2]), cex=1.5, font=2)
  text(70, 20, round(as.numeric(cm3$overall[2]), 3), cex=1.4)
}

draw_confusion_matrix(cm3)

According to the confusion matrix of Naive Bayes, the model’s accuracy is 70% which is average. 83% of the correctly predicted delayings turned out to be delaying. Whereas 79% of the delayings were successfully predicted by the model. That is good! However, specificity rate is 28% which is low.



CONCLUSION

According to the analysis in Exploratory Data Analysis section, the lowest percentage of departure delay is at Delta Airlines and the highest percentage of departure delay is on Saturday for each airline.

Overall, between those three models, the more accurate model is Naive Bayes with 72% rate but it failed in specificity rate that is low. Between KNN and logistic regression models, logistic regression has better precision and accuracy level.



FURTHER RESOURCES

Learn more about regression model applications using R with the following:



WORKS CITED

This code through references and cites the following sources: